| Student | Matricola | |
|---|---|---|
| Protani Andrea | 1860126 | protani.1860126@studenti.uniroma1.it |
| Tromboni Gabriele | 2088799 | tromboni.2088799@studenti.uniroma1.it |
| Boesso Simone | 1800408 | boesso.1800408@studenti.uniroma1.it |
Process: suppose that \(X
\sim \text{Unif}(0, 1)\), and suppose we independently draw \(\{Y1, Y2, Y3, . . .\}\) from yet another
\(\text{Unif}(0, 1)\) model until we
reach the random stopping time \(T\)
such that \((Y_T < X)\).
Question: it can be shown that the (marginal) PMF of
\(T\) is such that Pr\((T = t) = \frac{1}{t(t+1)}\) for \(t \in \{1,2,3,...\}\). Setup a simulation
in R that implements the sampling scheme above in order to numerically
check this result. Quantitatively check how the simulation size impacts
the approximation. Make some suitable plot to support your comments and
conclusions.
What was our idea in order to write code as fast as possible?
The event \(T\) is {first instant time such that \((Y_T < X)\) happens}, that is the first “trial” at which that condition happens
For this very reason, \(T\) conditionally \(X=x\) is distributed as a Geometric Distribution with probability of success as \(x\), namely \((T|X=x) \sim Geom(x)\).
A variable with geometric distribution returns the number of times it takes for a certain condition to happen, which in our case is the waiting time \(T\) until \((Y_T<X)\)
So basically our code is based on the idea written above, picking
randomly observations from a Geometric Distribution with probability of
success equal to a random sample from a \(Unif(0,1)\). This will be the random
stopping time that we increment by 1 because we want our counter to
start from 1 and not from 0.
For the last M we will use an approach also based on parallelization
through the library future.apply which allows us to speed up the process
by a few tenths.
Let’s calculate the median time taken for each \(\texttt{M}\) to do the simulation.
library(future.apply, quietly = TRUE)
cores = parallel::detectCores()
for (m in 1:length(M)){
m1 = M[m]
for (m2 in 1:N){
set.seed(13112221)
if (m == 6){
plan(multisession)
beg <- Sys.time()
new_M = M/cores
t_data = rep(NA, cores)
t_data = future_lapply(t_data, future.seed=T, function(m){
t = rgeom(new_M, runif(new_M)) + 1
return (t)
})
tempi[m2] = Sys.time() - beg
}
else {
beg = Sys.time()
t_data = rgeom(m1, runif(m1)) + 1
tempi[m2] = Sys.time() - beg
}
}
tempi_finali[m] = round(median(tempi), 3)
}
tempi_finali
## [1] 0.000 0.000 0.001 0.015 0.139 0.893
Let’s now compare with some plots our result with the theoretical distribution of \(T\) for each \(\texttt{M}\), namely \(\mathbb{P}(T = t) = \frac{1}{t(t+1)}\)
From the plots it can be seen that already with M=1000 the approximation is quite good and from M=10000 the real density and the estimated one practically coincide.
As a last step let’s visualize also how our code scale with the increase of \(\texttt{M}\).
| Simulation Size M | Time |
|---|---|
| 100 | \(\approx\) 0 sec |
| 1.000 | \(\approx\) 0 sec |
| 10.000 | \(\approx\) 0.002 sec |
| 100.000 | \(\approx\) 0.015 sec |
| 1.000.000 | \(\approx\) 0.14 sec |
| 10.000.000 | \(\approx\) 0.91 sec |
To better appreciate the benefit of parallelizing the code we increase the simulation size to M=100.000.000
Let’s first run the base code and check the time that it takes:
## Time difference of 14.862 secs
And now let’s implement the multiprocessing approach:
set.seed(13112221)
new_M = 100000000
cores = parallel::detectCores()
plan(multisession)
beg_mp <- Sys.time()
M = new_M/cores
t_data = rep(NA, cores)
t_data = future_lapply(t_data, future.seed=T, function(m){
t = rgeom(M, runif(M)) + 1
return (t)
})
fin_mp <- Sys.time() - beg_mp
round(fin_mp, 3)
## Time difference of 4.951 secs
So these show that with a large \(\texttt{M}\) it is very useful to parallelize the code, in fact in this way is almost \(70\%\) faster.
Let’s focus on the univariate case with \(d = 1\) so that the the measurement space
is the unit interval, \(\mathcal X = [0,
1]\). Assume also that the true density \(p_X (·)\) behind your data \(X\) is \(\underline{\text{known}}\) and equal to a
Beta(\(\alpha = 10, \beta = 10\)). In
this part of the exercise you have to setup up a simulation to compare
the Mean Integrated Squared Error (\(\texttt{MISE}\), see below) between the
true model \(p_X (·)\) and its two
approximations \(\hat p_{n, m} (·)\)
and \(\hat q_{\epsilon, m}(·)\).
\(\texttt{MISE}(p_X, \hat p_{n, m}) =
\mathbb{E}_{p_X} (\int_0^1 (p_X(x) - \hat p_{n, m}(x))^2
\text{d}x)\) = { \(\texttt{MISE}\) between the true model and
the original histogram},
\(\texttt{MISE}(p_X, \hat q_{\epsilon, m}) =
\mathbb{E}_{p_X, Q} (\int_0^1 (p_X(x) - \hat q_{\epsilon, m}(x))^2
\text{d}x)\) = { \(\texttt{MISE}\) between the true model and
the privatized histogram}.
It is crucial to understand that here we are dealing with two sources of randomness: 1. the randomness due to iid-sampling from the population model \(P_X(·); 2\). the randomness due to the privacy mechanism $Q(·) $ for us, this is the \(\texttt{IID}\)-sampling from the \(\texttt{Laplace}\). Consequently, for a generic transformation \(r(·)\), the expectation \(\mathbb{E}_{p_X, Q}(·)\) above should be parsed as
\[ \mathbb{E}_{p_X, Q}(r(Z_1, \cdots,
Z_k)) \stackrel{\texttt{LLS}}{=} \int \bigg(\int r(z_1,..., z_k)
\text{d}Q(z_1,...,z_k|x_1,...,x_n) \bigg) \text{d}P_X(x_1) \cdots
P_X(x_n)\]
Once this is clear (and you \(\underline{\text{must}}\) ask questions if it’s not!), the following, are the relevant simulation parameters to try:
We initialize the parameters by choosing M=1000 and m as a sequence from 5 to 50 of step 5.
First MISE: to estimate the first MISE we use the stepfun to approximate the histogram, we create a function to calculate the squared differences, i.e. the argument of the integral and we repeat the sampling and relative calculation of the integral for M times to then calculate the empirical mean.
par(mfrow=c(2, 2))
for (n_rep in 1:length(n)){
n1 = n[n_rep]
for (i in 1:length(m)){
m1 = m[i]
for (j in 1:M){
X = rbeta(n1, 10, 10)
classes = seq(min(X), max(X), (max(X) - min(X)) / m1)
p_hat = hist(X, breaks=classes, plot=F)
step_fun = stepfun(p_hat$breaks, c(0, p_hat$density, 0))
errors[j] = integrate(square_diff, 0, 1, subdivisions=1000)$value
if (n1==100 & m1==25 & j==500){
hist(X, breaks=classes, prob=T, col='cornflowerblue', border='white',
main='Real density of a Beta(10, 10) vs\nthe histogram of the simulation',
xlab='', ylab='Density', sub='Example with n=100 and m=25')
grid()
box()
curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
}
if (n1==1000 & m1==25 & j==500){
hist(X, breaks=classes, prob=T, col='cornflowerblue', border='white',
main='Real density of a Beta(10, 10) vs\nthe histogram of the simulation',
xlab='', ylab='Density', sub='Example with n=100 and m=25')
grid()
box()
curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
}
}
MISE1[n_rep, i] = round(mean(errors), 3)
}
}
plot(seq(5, 50, 5), MISE1[1,], type='o', pch=16, lwd=2, col='cornflowerblue',
main='MISE behavior with\nn=100 and different m',
xlab='m', ylab='MISE')
grid()
plot(seq(5, 50, 5), MISE1[2,], type='o', pch=16, lwd=2, col='cornflowerblue',
main='MISE behavior with\nn=1000 and different m',
xlab='m', ylab='MISE')
grid()
Second MISE: first we use the \(\texttt{rlaplace}\) function of the VGAM package to perturb the histograms, at this point we pass from the new counts to the densities which we then pass to stepfun, then we continue as in the first MISE.
par(mfrow=c(2, 2))
for (n_rep in 1:length(n)){
n1 = n[n_rep]
for (eps in epsilon){
for (i in 1:length(m)){
m1 = m[i]
for (j in 1:M){
X = rbeta(n1, 10, 10)
classes = seq(min(X), max(X), (max(X) - min(X)) / m1)
p_hat = hist(X, breaks=classes, plot=F)
nu = rlaplace(m1, 0, sqrt(4/(eps^2)))
q_hat = p_hat$counts + nu
q_hat[q_hat < 0] = 0
bin_width = classes[2] - classes[1]
q_hat_density = (q_hat/sum(q_hat))*(1/bin_width)
step_fun = stepfun(p_hat$breaks, c(0, q_hat_density, 0))
errors[j] = integrate(square_diff, 0, 1, subdivisions=1000, stop.on.error=FALSE)$value
if (n1==100 & m1==25 & j==500 & eps==0.1){
plot(step_fun, col='cornflowerblue',
main='Approximation of the perturbed histogram\n vs the real density',
lwd=3,
xlab='x', ylab='Density', sub='Example with n=100, m=25 and epsilon=0.1')
grid()
box()
curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
}
if (n1==1000 & m1==25 & j==500 & eps==0.001){
plot(step_fun, col='cornflowerblue',
main='Approximation of the perturbed histogram\n vs the real density',
lwd=3,
xlab='x', ylab='Density', sub='Example with n=1000, m=25 and epsilon=0.001')
grid()
box()
curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
}
}
if (n1 == 100 & eps == 0.1) MISE2$n100eps01[i] = round(mean(errors), 3)
if (n1 == 100 & eps == 0.001) MISE2$n100eps0001[i] = round(mean(errors), 3)
if (n1 == 1000 & eps == 0.1) MISE2$n1000eps01[i] = round(mean(errors), 3)
if (n1 == 1000 & eps == 0.001) MISE2$n1000eps0001[i] = round(mean(errors), 3)
}
}
}
plot(seq(5, 50, 5), MISE2$n100eps01, type='o', pch=16, lwd=2, col='cornflowerblue',
main='MISE behavior with\nn=100 and different m',
xlab='m', ylab='MISE', ylim=c(0, 8))
legend('bottomright', c('eps=0.1', 'eps=0.001'), lty=1, lwd=3, col=c('cornflowerblue', 'coral2'))
grid()
lines(seq(5, 50, 5), MISE2$n100eps0001, lwd=2, col='coral2', type='o', pch=16)
plot(seq(5, 50, 5), MISE2$n1000eps01, type='o', pch=16, lwd=2, col='cornflowerblue',
main='MISE behavior with\nn=1000 and different m',
xlab='m', ylab='MISE', ylim=c(0, 8))
legend('topleft', c('eps=0.1', 'eps=0.001'), lty=1, lwd=3, col=c('cornflowerblue', 'coral2'))
grid()
lines(seq(5, 50, 5), MISE2$n1000eps0001, lwd=2, col='coral2', type='o', pch=16)
To choose the mixture we ‘played’ with manipulate to find a shape of
the mixture that satisfied us.
At the end we chose: \(A \sim \text{Beta}(17,
5)\), \(B \sim \text{Beta}(4,
10)\) and \(\pi = 0.3\).
So as a first step we initialize the parameters and we create the
following functions:
- \(\texttt{p_X}\): is the density of
the mixture
- \(\texttt{rmixbeta}\): it will be
used for the sample
- \(\texttt{square_diff}\): the new,
slightly modified, square loss
Plots of chosen betas, comparison between marginals and generated mixture
Estimation of the first MISE:
Estimation of the second MISE:
Finally we compute \(\texttt{MISE}(\hat q_{\epsilon,m}, \hat p_{n,m})\), we have to modify the loss function again because in this case it has to calculate the differences between the approximations of two histograms, so it will receive two stepfunctions as input
We can see that in every try the MISE is lower when n=1000 in respect n=100, so, as we expected, the simulation size has a positive effect on the approximation.
One important thing to note is that as \(\epsilon\) decreases, information loss increases.
When n=100 the MISE increase constantly with m and the best choice for m is 5 or at most 10, regardless of simulated and perturbed histograms.
When n=1000 we have two different behaviors:
Non perturbed: the best choice is m around 20
Perturbed: the best choice regardless of \(\epsilon\) is m equal to 5 or 10
If we now compare the MISE between the mixed beta model and the single one, we can observe that the behaviors are similar but the mixture seems to have lower errors.
The main goal of our analysis is to derive statistics from the daily habits of us young people.
We created a Microsoft form with three “simple” questions regarding three main topics of our age:
After receiving a fair number of responses that we could analyze, we noticed that the most satisfying, uniform data and the data in which we were most interested were those concerning sexual habits.
To have an idea of what we are dealing with let’s plot an histogram and a summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1 1 5 9622 10 1000000
It is clear that there are some outliers that we need to remove…
Now it’s much better so we can move on.
Having received a hundred responses in order to get a fair trade-off between the number of bins and observations made, we thought of dividing our data into \(\texttt{m}=10\) bins so that we would have a satisfying amount of data for each. As for \(k\), on the other hand, which is the size of our privatized dataset through perturbed histogram approach, we chose \(80\), a high value that allows us to maintain the qualities of the informations received without actually showing the latter in its totality. Instead, the choice of \(\epsilon\) value in the Laplace mechanism fell to \(0.1\) in order to perturb the data significantly but without losing too much information, because an even lower value, as could have been \(0.001\), would have made the perturbed dataset too dispersive and misleading compared to the original.
Let’s normalize our data dividing each element for the max and plot a new histogram whit number of bins equal to 10.
Let’s apply the perturbation to our histogram in order to privatize the data. The information we would like to continue to have after perturbing the data are those shown with the summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.033 0.167 0.225 0.333 1.000
As a last step we sample \(k\) elements from our perturbed histogram, this sample will be the new dataset.
bin_width = classes[2] - classes[1]
q_hat_density = (q_hat/sum(q_hat))*(1/bin_width)
k = 80
from_bins = sample(x=c(1:length(histogram$counts)), size=k, replace=T,
prob=q_hat_density/sum(q_hat_density))
new_values = rep(NA, k)
for (i in 1:length(from_bins)){
bin = from_bins[i]
new_values[i] = runif(1, histogram$breaks[bin], histogram$breaks[bin+1])
}
hist(new_values, prob=T, xlim=c(0, 1), ylim=c(0, 5), col='cornflowerblue', border='white',
main='Histogram of the new dataset compared\nto the perturbed histogram', xlab='Bins')
plot(stepfun(histogram$breaks, c(0, q_hat_density, 0)), add=T, col='coral2', lwd=2, lty=2)
box()
grid()
legend('topright', c('Perturbed histogram', 'Sample from the\nperturbed histogram'), pch=15, cex=1.3, col=c('coral2', 'cornflowerblue'))
Let’s show main statistics comparing those concerning the perturbed histogram and privatized dataset sample itself. We can notice that the sample approximate quite well the perturbed histogram, but let’s check how many information we have lost.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.033 0.167 0.225 0.333 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.062 0.177 0.280 0.531 0.859
So in the end the new dataset will have dimension:
## [1] 80 1
Taking a look of what there is inside…
## new_values
## 1 0.15164498
## 2 0.57715473
## 3 0.50994714
## 4 0.67650068
## 5 0.02090325
## 6 0.29925384
We decided to face the proof through a simulation that gives some evidence just for one \(\epsilon\), it is based on:
set.seed(69)
prova = function(j){
tryCatch({
X_samp = rbeta(n, 10, 10)
new_X_samp = X_samp
change = sample(1:n, 1)
new_Xi = rbeta(1, 10, 10)
new_X_samp[change] = new_Xi
classes1 = seq(min(X_samp), max(X_samp), (max(X_samp) - min(X_samp)) / m)
p_hat1 = hist(X_samp, breaks=classes1, plot=F)
p_hat2 = hist(new_X_samp, breaks=classes1, plot=F)
nu = rlaplace(m, 0, sqrt(4/(epsilon^2)))
q_hat1 = p_hat1$counts + nu
q_hat1[q_hat1 < 0] = 0
q_hat2 = p_hat2$counts + nu
q_hat2[q_hat2 < 0] = 0
bin_width = classes1[2] - classes1[1]
q_hat1_density = (q_hat1/sum(q_hat1))*(1/bin_width)
q_hat2_density = (q_hat2/sum(q_hat2))*(1/bin_width)
step_fun_q1 = stepfun(p_hat1$breaks, c(0, q_hat1_density, 0))
step_fun_q2 = stepfun(p_hat2$breaks, c(0, q_hat2_density, 0))
if (i==69 & j==69){
plot(step_fun_q1, col='cornflowerblue',
main='Stepfunctions of the simulations', lwd=3,
xlab='x', ylab='Density', sub='Example with n=100, m=25 and epsilon=0.1')
grid()
box()
plot(step_fun_q2, col='coral2', add=T, lwd=2, lty=3)
legend('topright', c('q_hat1', 'q_hat2'), lty=c(1,3), lwd=3, col=c('cornflowerblue', 'coral2'))
}
div = q_hat1/q_hat2
return (div)
},
error = function(e) NA)
}
for (i in 1:M2){
for (j in 1:M){
div2 = prova(j)
division[j] = mean(div2[!is.na(div2) & !is.infinite(div2)])
}
condition[i] = max(division[!is.na(division) & !is.infinite(division)]) < exp(epsilon)
}
The condition was fulfilled 100 times out of 100 for an accuracy equal to 1.
## [1] 100
## [1] 1